Introduction¶
Apart from applying the techniques in the EDA module,I will also develop a basic understanding of risk analytics in banking and financial services and understand how data is used to minimise the risk of losing money while lending to customers.nce the tendency to default.
Business Understanding¶
The loan providing companies find it hard to give loans to the people due to their insufficient or non-existent credit history. Because of that, some consumers use it to their advantage by becoming a defaulter. Suppose you work for a consumer finance company which specialises in lending various types of loans to urban customers. You have to use EDA to analyse the patterns present in the data. This will ensure that the applicants capable of repaying the loan are not rejected.
When the company receives a loan application, the company has to decide for loan approval based on the applicant’s profile. Two types of risks are associated with the bank’s decision:
If the applicant is likely to repay the loan, then not approving the loan results in a loss of business to the company
If the applicant is not likely to repay the loan, i.e. he/she is likely to default, then approving the loan may lead to a financial loss for the company.
The data given below contains the information about the loan application at the time of applying for the loan. It contains two types of scenarios:
The client with payment difficulties: he/she had late payment more than X days on at least one of the first Y instalments of the loan in our sample,
All other cases: All other cases when the payment is paid on time.
When a client applies for a loan, there are four types of decisions that could be taken by the client/company):
Approved: The Company has approved loan Application
Cancelled: The client cancelled the application sometime during approval. Either the client changed her/his mind about the loan or in some cases due to a higher risk of the client, he received worse pricing which he did not want.
Refused: The company had rejected the loan (because the client does not meet their requirements etc.).
Unused offer: Loan has been cancelled by the client but at different stages of the process.
In this case study, you will use EDA to understand how consumer attributes and loan attributes influence the tendency to default.
Business Objectives¶
This case study aims to identify patterns which indicate if a client has difficulty paying their instalments which may be used for taking actions such as denying the loan, reducing the amount of loan, lending (to risky applicants) at a higher interest rate, etc. This will ensure that the consumers capable of repaying the loan are not rejected. Identification of such applicants using EDA is the aim of this case study.
In other words, the company wants to understand the driving factors (or driver variables) behind loan default, i.e. the variables which are strong indicators of default. The company can utilise this knowledge for its portfolio and risk assessment.
To develop your understanding of the domain, you are advised to independently research a little about risk analytics - understanding the types of variables and their significance should be enough.
This dataset has 3 files as explained below:
'application_data.csv' contains all the information of the client at the time of application. The data is about whether a client has payment difficulties.
'previous_application.csv' contains information about the client’s previous loan data. It contains the data on whether the previous application had been Approved, Cancelled, Refused or Unused offer.
'columns_description.csv' is data dictionary which describes the meaning of the variables.
# Importing important libraries
import pandas as pd , numpy as np
import matplotlib.pyplot as plt
import seaborn as sns ; sns.set(color_codes = True)
import warnings
warnings.filterwarnings("ignore")
Reading a Files¶
Application Data¶
# Reading Application Dataset
application_data = pd.read_csv("application_data.csv")
# To Read All Rows And Columns
pd.set_option("display.max_columns",None)
#pd.set_option("display.max_rows",None)
# Fetching Firat Five Columns
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | OWN_CAR_AGE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_1 | EXT_SOURCE_2 | EXT_SOURCE_3 | APARTMENTS_AVG | BASEMENTAREA_AVG | YEARS_BEGINEXPLUATATION_AVG | YEARS_BUILD_AVG | COMMONAREA_AVG | ELEVATORS_AVG | ENTRANCES_AVG | FLOORSMAX_AVG | FLOORSMIN_AVG | LANDAREA_AVG | LIVINGAPARTMENTS_AVG | LIVINGAREA_AVG | NONLIVINGAPARTMENTS_AVG | NONLIVINGAREA_AVG | APARTMENTS_MODE | BASEMENTAREA_MODE | YEARS_BEGINEXPLUATATION_MODE | YEARS_BUILD_MODE | COMMONAREA_MODE | ELEVATORS_MODE | ENTRANCES_MODE | FLOORSMAX_MODE | FLOORSMIN_MODE | LANDAREA_MODE | LIVINGAPARTMENTS_MODE | LIVINGAREA_MODE | NONLIVINGAPARTMENTS_MODE | NONLIVINGAREA_MODE | APARTMENTS_MEDI | BASEMENTAREA_MEDI | YEARS_BEGINEXPLUATATION_MEDI | YEARS_BUILD_MEDI | COMMONAREA_MEDI | ELEVATORS_MEDI | ENTRANCES_MEDI | FLOORSMAX_MEDI | FLOORSMIN_MEDI | LANDAREA_MEDI | LIVINGAPARTMENTS_MEDI | LIVINGAREA_MEDI | NONLIVINGAPARTMENTS_MEDI | NONLIVINGAREA_MEDI | FONDKAPREMONT_MODE | HOUSETYPE_MODE | TOTALAREA_MODE | WALLSMATERIAL_MODE | EMERGENCYSTATE_MODE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.083037 | 0.262949 | 0.139376 | 0.0247 | 0.0369 | 0.9722 | 0.6192 | 0.0143 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0369 | 0.0202 | 0.0190 | 0.0000 | 0.0000 | 0.0252 | 0.0383 | 0.9722 | 0.6341 | 0.0144 | 0.0000 | 0.0690 | 0.0833 | 0.1250 | 0.0377 | 0.022 | 0.0198 | 0.0 | 0.0 | 0.0250 | 0.0369 | 0.9722 | 0.6243 | 0.0144 | 0.00 | 0.0690 | 0.0833 | 0.1250 | 0.0375 | 0.0205 | 0.0193 | 0.0000 | 0.00 | reg oper account | block of flats | 0.0149 | Stone, brick | No | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | NaN | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.311267 | 0.622246 | NaN | 0.0959 | 0.0529 | 0.9851 | 0.7960 | 0.0605 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0130 | 0.0773 | 0.0549 | 0.0039 | 0.0098 | 0.0924 | 0.0538 | 0.9851 | 0.8040 | 0.0497 | 0.0806 | 0.0345 | 0.2917 | 0.3333 | 0.0128 | 0.079 | 0.0554 | 0.0 | 0.0 | 0.0968 | 0.0529 | 0.9851 | 0.7987 | 0.0608 | 0.08 | 0.0345 | 0.2917 | 0.3333 | 0.0132 | 0.0787 | 0.0558 | 0.0039 | 0.01 | reg oper account | block of flats | 0.0714 | Block | No | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 26.0 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | NaN | 0.555912 | 0.729567 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | NaN | 0.650442 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | NaN | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | NaN | 0.322738 | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
#shape of the data
application_data.shape
(307511, 122)
# the info of all the columns
application_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Columns: 122 entries, SK_ID_CURR to AMT_REQ_CREDIT_BUREAU_YEAR dtypes: float64(65), int64(41), object(16) memory usage: 286.2+ MB
Data Cleaning On Application Data¶
Checking Application Data Null Values¶
lets check the missing value one by one
# shows how much percentage of data are null
(application_data.isnull().sum()/application_data.shape[0]*100).sort_values(ascending = False)
COMMONAREA_MEDI 69.872297
COMMONAREA_AVG 69.872297
COMMONAREA_MODE 69.872297
NONLIVINGAPARTMENTS_MODE 69.432963
NONLIVINGAPARTMENTS_AVG 69.432963
...
NAME_HOUSING_TYPE 0.000000
NAME_FAMILY_STATUS 0.000000
NAME_EDUCATION_TYPE 0.000000
NAME_INCOME_TYPE 0.000000
SK_ID_CURR 0.000000
Length: 122, dtype: float64
# Lets See the columns who has more than 40 % of data with Null values
cut_off=40
cols_to_drop=list(application_data.columns[100*application_data.isnull().mean()>cut_off])
print(cols_to_drop)
['OWN_CAR_AGE', 'EXT_SOURCE_1', 'APARTMENTS_AVG', 'BASEMENTAREA_AVG', 'YEARS_BEGINEXPLUATATION_AVG', 'YEARS_BUILD_AVG', 'COMMONAREA_AVG', 'ELEVATORS_AVG', 'ENTRANCES_AVG', 'FLOORSMAX_AVG', 'FLOORSMIN_AVG', 'LANDAREA_AVG', 'LIVINGAPARTMENTS_AVG', 'LIVINGAREA_AVG', 'NONLIVINGAPARTMENTS_AVG', 'NONLIVINGAREA_AVG', 'APARTMENTS_MODE', 'BASEMENTAREA_MODE', 'YEARS_BEGINEXPLUATATION_MODE', 'YEARS_BUILD_MODE', 'COMMONAREA_MODE', 'ELEVATORS_MODE', 'ENTRANCES_MODE', 'FLOORSMAX_MODE', 'FLOORSMIN_MODE', 'LANDAREA_MODE', 'LIVINGAPARTMENTS_MODE', 'LIVINGAREA_MODE', 'NONLIVINGAPARTMENTS_MODE', 'NONLIVINGAREA_MODE', 'APARTMENTS_MEDI', 'BASEMENTAREA_MEDI', 'YEARS_BEGINEXPLUATATION_MEDI', 'YEARS_BUILD_MEDI', 'COMMONAREA_MEDI', 'ELEVATORS_MEDI', 'ENTRANCES_MEDI', 'FLOORSMAX_MEDI', 'FLOORSMIN_MEDI', 'LANDAREA_MEDI', 'LIVINGAPARTMENTS_MEDI', 'LIVINGAREA_MEDI', 'NONLIVINGAPARTMENTS_MEDI', 'NONLIVINGAREA_MEDI', 'FONDKAPREMONT_MODE', 'HOUSETYPE_MODE', 'TOTALAREA_MODE', 'WALLSMATERIAL_MODE', 'EMERGENCYSTATE_MODE']
len(cols_to_drop)
49
As we can see there are 49 columns in application dataset which has more than 40% of data has NULL values, so we need to drop those columns are not required for analysis¶
application_data.drop(columns =cols_to_drop, axis=1, inplace = True )
application_data.shape
(307511, 73)
application_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 307511 entries, 0 to 307510 Data columns (total 73 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_CURR 307511 non-null int64 1 TARGET 307511 non-null int64 2 NAME_CONTRACT_TYPE 307511 non-null object 3 CODE_GENDER 307511 non-null object 4 FLAG_OWN_CAR 307511 non-null object 5 FLAG_OWN_REALTY 307511 non-null object 6 CNT_CHILDREN 307511 non-null int64 7 AMT_INCOME_TOTAL 307511 non-null float64 8 AMT_CREDIT 307511 non-null float64 9 AMT_ANNUITY 307499 non-null float64 10 AMT_GOODS_PRICE 307233 non-null float64 11 NAME_TYPE_SUITE 306219 non-null object 12 NAME_INCOME_TYPE 307511 non-null object 13 NAME_EDUCATION_TYPE 307511 non-null object 14 NAME_FAMILY_STATUS 307511 non-null object 15 NAME_HOUSING_TYPE 307511 non-null object 16 REGION_POPULATION_RELATIVE 307511 non-null float64 17 DAYS_BIRTH 307511 non-null int64 18 DAYS_EMPLOYED 307511 non-null int64 19 DAYS_REGISTRATION 307511 non-null float64 20 DAYS_ID_PUBLISH 307511 non-null int64 21 FLAG_MOBIL 307511 non-null int64 22 FLAG_EMP_PHONE 307511 non-null int64 23 FLAG_WORK_PHONE 307511 non-null int64 24 FLAG_CONT_MOBILE 307511 non-null int64 25 FLAG_PHONE 307511 non-null int64 26 FLAG_EMAIL 307511 non-null int64 27 OCCUPATION_TYPE 211120 non-null object 28 CNT_FAM_MEMBERS 307509 non-null float64 29 REGION_RATING_CLIENT 307511 non-null int64 30 REGION_RATING_CLIENT_W_CITY 307511 non-null int64 31 WEEKDAY_APPR_PROCESS_START 307511 non-null object 32 HOUR_APPR_PROCESS_START 307511 non-null int64 33 REG_REGION_NOT_LIVE_REGION 307511 non-null int64 34 REG_REGION_NOT_WORK_REGION 307511 non-null int64 35 LIVE_REGION_NOT_WORK_REGION 307511 non-null int64 36 REG_CITY_NOT_LIVE_CITY 307511 non-null int64 37 REG_CITY_NOT_WORK_CITY 307511 non-null int64 38 LIVE_CITY_NOT_WORK_CITY 307511 non-null int64 39 ORGANIZATION_TYPE 307511 non-null object 40 EXT_SOURCE_2 306851 non-null float64 41 EXT_SOURCE_3 246546 non-null float64 42 OBS_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 43 DEF_30_CNT_SOCIAL_CIRCLE 306490 non-null float64 44 OBS_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 45 DEF_60_CNT_SOCIAL_CIRCLE 306490 non-null float64 46 DAYS_LAST_PHONE_CHANGE 307510 non-null float64 47 FLAG_DOCUMENT_2 307511 non-null int64 48 FLAG_DOCUMENT_3 307511 non-null int64 49 FLAG_DOCUMENT_4 307511 non-null int64 50 FLAG_DOCUMENT_5 307511 non-null int64 51 FLAG_DOCUMENT_6 307511 non-null int64 52 FLAG_DOCUMENT_7 307511 non-null int64 53 FLAG_DOCUMENT_8 307511 non-null int64 54 FLAG_DOCUMENT_9 307511 non-null int64 55 FLAG_DOCUMENT_10 307511 non-null int64 56 FLAG_DOCUMENT_11 307511 non-null int64 57 FLAG_DOCUMENT_12 307511 non-null int64 58 FLAG_DOCUMENT_13 307511 non-null int64 59 FLAG_DOCUMENT_14 307511 non-null int64 60 FLAG_DOCUMENT_15 307511 non-null int64 61 FLAG_DOCUMENT_16 307511 non-null int64 62 FLAG_DOCUMENT_17 307511 non-null int64 63 FLAG_DOCUMENT_18 307511 non-null int64 64 FLAG_DOCUMENT_19 307511 non-null int64 65 FLAG_DOCUMENT_20 307511 non-null int64 66 FLAG_DOCUMENT_21 307511 non-null int64 67 AMT_REQ_CREDIT_BUREAU_HOUR 265992 non-null float64 68 AMT_REQ_CREDIT_BUREAU_DAY 265992 non-null float64 69 AMT_REQ_CREDIT_BUREAU_WEEK 265992 non-null float64 70 AMT_REQ_CREDIT_BUREAU_MON 265992 non-null float64 71 AMT_REQ_CREDIT_BUREAU_QRT 265992 non-null float64 72 AMT_REQ_CREDIT_BUREAU_YEAR 265992 non-null float64 dtypes: float64(20), int64(41), object(12) memory usage: 171.3+ MB
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | DAYS_LAST_PHONE_CHANGE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | -1134.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 1.0 | 0.0 | 1.0 | 0.0 | -828.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | -815.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | NaN | 2.0 | 0.0 | 2.0 | 0.0 | -617.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | -1106.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
DAYS_LAST_PHONE_CHANGE 0.000325 CNT_FAM_MEMBERS 0.000650 AMT_ANNUITY 0.003902 AMT_GOODS_PRICE 0.090403 EXT_SOURCE_2 0.214626 DEF_60_CNT_SOCIAL_CIRCLE 0.332021 OBS_60_CNT_SOCIAL_CIRCLE 0.332021 DEF_30_CNT_SOCIAL_CIRCLE 0.332021 OBS_30_CNT_SOCIAL_CIRCLE 0.332021 NAME_TYPE_SUITE 0.420148 AMT_REQ_CREDIT_BUREAU_QRT 13.501631 AMT_REQ_CREDIT_BUREAU_HOUR 13.501631 AMT_REQ_CREDIT_BUREAU_DAY 13.501631 AMT_REQ_CREDIT_BUREAU_WEEK 13.501631 AMT_REQ_CREDIT_BUREAU_MON 13.501631 AMT_REQ_CREDIT_BUREAU_YEAR 13.501631 EXT_SOURCE_3 19.825307 OCCUPATION_TYPE 31.345545 dtype: float64
Handling Null Columns¶
1. Columns analysis for handling null values For credit Bureau Enquiry¶
- AMT_REQ_CREDIT_BUREAU_HOUR
- AMT_REQ_CREDIT_BUREAU_DAY
- AMT_REQ_CREDIT_BUREAU_WEEK
- AMT_REQ_CREDIT_BUREAU_MON
- AMT_REQ_CREDIT_BUREAU_QRT
- AMT_REQ_CREDIT_BUREAU_YEAR
#Number of enquiries to Credit Bureau about the client before submitting the application
application_data[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']]
| AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 |
| 1 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | NaN | NaN | NaN | NaN | NaN | NaN |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... | ... | ... |
| 307506 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307507 | NaN | NaN | NaN | NaN | NaN | NaN |
| 307508 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 |
| 307509 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 1.0 |
307511 rows × 6 columns
application_data['AMT_REQ_CREDIT_BUREAU_HOUR'].describe()
count 265992.000000 mean 0.006402 std 0.083849 min 0.000000 25% 0.000000 50% 0.000000 75% 0.000000 max 4.000000 Name: AMT_REQ_CREDIT_BUREAU_HOUR, dtype: float64
application_data[['AMT_REQ_CREDIT_BUREAU_HOUR']].aggregate(['mean','median'])
| AMT_REQ_CREDIT_BUREAU_HOUR | |
|---|---|
| mean | 0.006402 |
| median | 0.000000 |
Going deep down on the credit enquiry I get to know why credit equiries has been done applying for the loan application
What is Credit Enquiry ? (research has been done from google)¶
A credit inquiry is a request for an individual's credit report information from a credit bureau. credit enquiry can be made by financial institutes, individuals, and Company. A credit inquiry is a credit check. Inquiries happen when there is a legally permitted request to see your credit report from a company or person.
The columns are inportant for the analysis as we can get to know already inquiry has been done from the credit bureau or not we can replace the NAN value with median in all these 6 Columns
fill_na = application_data[['AMT_REQ_CREDIT_BUREAU_HOUR','AMT_REQ_CREDIT_BUREAU_DAY','AMT_REQ_CREDIT_BUREAU_WEEK','AMT_REQ_CREDIT_BUREAU_MON','AMT_REQ_CREDIT_BUREAU_QRT','AMT_REQ_CREDIT_BUREAU_YEAR']]
for col in fill_na:
med = application_data[col].median()
application_data[col] = application_data[col].fillna(med)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
DAYS_LAST_PHONE_CHANGE 0.000325 CNT_FAM_MEMBERS 0.000650 AMT_ANNUITY 0.003902 AMT_GOODS_PRICE 0.090403 EXT_SOURCE_2 0.214626 OBS_30_CNT_SOCIAL_CIRCLE 0.332021 DEF_30_CNT_SOCIAL_CIRCLE 0.332021 OBS_60_CNT_SOCIAL_CIRCLE 0.332021 DEF_60_CNT_SOCIAL_CIRCLE 0.332021 NAME_TYPE_SUITE 0.420148 EXT_SOURCE_3 19.825307 OCCUPATION_TYPE 31.345545 dtype: float64
2. Handling Null Values For Days Past Dus¶
- OBS_30_CNT_SOCIAL_CIRCLE
- DEF_30_CNT_SOCIAL_CIRCLE
- OBS_60_CNT_SOCIAL_CIRCLE
- DEF_60_CNT_SOCIAL_CIRCLECLE
Reserach is done from google
DPD is a significant indicator of the financial behaviour of the borrower. Late payment of the dues can affect the CIBIL report in more ways than one. DPD indicates the number of days by which the payment is delayed. In case the applicant has missed any payments in the past, the DPD section reports the details of the delay. It depicts how many days the delay was made and in which month did it happen.
application_data.loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
| OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | |
|---|---|---|---|---|
| 0 | 2.0 | 2.0 | 2.0 | 2.0 |
| 1 | 1.0 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 0.0 | 0.0 | 0.0 |
| 3 | 2.0 | 0.0 | 2.0 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... |
| 307506 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307507 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307508 | 6.0 | 0.0 | 6.0 | 0.0 |
| 307509 | 0.0 | 0.0 | 0.0 | 0.0 |
| 307510 | 0.0 | 0.0 | 0.0 | 0.0 |
307511 rows × 4 columns
application_data[application_data['OBS_30_CNT_SOCIAL_CIRCLE'] == 30.0].loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
| OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | |
|---|---|---|---|---|
| 169517 | 30.0 | 0.0 | 30.0 | 0.0 |
| 280641 | 30.0 | 0.0 | 29.0 | 0.0 |
#pd.set_option("display.max_rows",None)
application_data[application_data['OBS_30_CNT_SOCIAL_CIRCLE'].isnull()].loc[:,'OBS_30_CNT_SOCIAL_CIRCLE':'DEF_60_CNT_SOCIAL_CIRCLE']
| OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | |
|---|---|---|---|---|
| 68 | NaN | NaN | NaN | NaN |
| 394 | NaN | NaN | NaN | NaN |
| 397 | NaN | NaN | NaN | NaN |
| 457 | NaN | NaN | NaN | NaN |
| 1042 | NaN | NaN | NaN | NaN |
| ... | ... | ... | ... | ... |
| 305526 | NaN | NaN | NaN | NaN |
| 305623 | NaN | NaN | NaN | NaN |
| 305641 | NaN | NaN | NaN | NaN |
| 305642 | NaN | NaN | NaN | NaN |
| 307402 | NaN | NaN | NaN | NaN |
1021 rows × 4 columns
As we can observe from the above two table
- First one shows that there are people who has taken the loan and have difficult to repay the loan amount with 30 past due days
- seconed one shows thta the nan is for all the four columns, means there no due days for any of the rows
so observing the data we can replace the NaN values with 0 , assuming the client has not done any default or observed in these 30 and 60 DPD time Frame
fill_na_2 = application_data[['OBS_30_CNT_SOCIAL_CIRCLE','DEF_30_CNT_SOCIAL_CIRCLE','OBS_60_CNT_SOCIAL_CIRCLE','DEF_60_CNT_SOCIAL_CIRCLE']]
for col in fill_na_2:
application_data[col].fillna(0,inplace=True)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
DAYS_LAST_PHONE_CHANGE 0.000325 CNT_FAM_MEMBERS 0.000650 AMT_ANNUITY 0.003902 AMT_GOODS_PRICE 0.090403 EXT_SOURCE_2 0.214626 NAME_TYPE_SUITE 0.420148 EXT_SOURCE_3 19.825307 OCCUPATION_TYPE 31.345545 dtype: float64
3. DAYS_LAST_PHONE_CHANGE¶
By seeing the data set in DAYS_LAST_PHONE_CHANGE contains negative values which is not correct need to change in absolute and days cannot be negative fill Nan valus with mode
application_data[['DAYS_LAST_PHONE_CHANGE']]
| DAYS_LAST_PHONE_CHANGE | |
|---|---|
| 0 | -1134.0 |
| 1 | -828.0 |
| 2 | -815.0 |
| 3 | -617.0 |
| 4 | -1106.0 |
| ... | ... |
| 307506 | -273.0 |
| 307507 | 0.0 |
| 307508 | -1909.0 |
| 307509 | -322.0 |
| 307510 | -787.0 |
307511 rows × 1 columns
# converting from days to month first converting to abs value to make all number to positive
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'].abs()
application_data['DAYS_LAST_PHONE_CHANGE'].fillna(application_data['DAYS_LAST_PHONE_CHANGE'].mode()[0],inplace=True)
Lets Standardise DAYS_LAST_PHONE_CHANGE Change Days to year and change the data tyope float to int
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'].astype(int)
application_data['DAYS_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE'] //365
application_data['YEAR_LAST_PHONE_CHANGE'] = application_data['DAYS_LAST_PHONE_CHANGE']
application_data.drop(columns = 'DAYS_LAST_PHONE_CHANGE',axis=1, inplace=True)
application_data.shape
(307511, 73)
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2.0 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | NaN | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1.0 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2.0 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | NaN | 2.0 | 0.0 | 2.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1.0 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | NaN | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
4. CNT_FAM_MEMBERS¶
First of all count of family numbers is in float while seeing the data for count of family members where shoukd be whle numbers data type should be integer Fill the NaN values with median
application_data[['CNT_FAM_MEMBERS']]
| CNT_FAM_MEMBERS | |
|---|---|
| 0 | 1.0 |
| 1 | 2.0 |
| 2 | 1.0 |
| 3 | 2.0 |
| 4 | 1.0 |
| ... | ... |
| 307506 | 1.0 |
| 307507 | 1.0 |
| 307508 | 1.0 |
| 307509 | 2.0 |
| 307510 | 2.0 |
307511 rows × 1 columns
#fillinf na with family members
application_data["CNT_FAM_MEMBERS"] =application_data["CNT_FAM_MEMBERS"].replace(np.NaN, application_data["CNT_FAM_MEMBERS"].median())
application_data['CNT_FAM_MEMBERS'].isnull().sum()
0
changing datatype from float to int CNT_FAM_MEMBERS
application_data['CNT_FAM_MEMBERS'] = application_data['CNT_FAM_MEMBERS'].astype(int)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
AMT_ANNUITY 0.003902 AMT_GOODS_PRICE 0.090403 EXT_SOURCE_2 0.214626 NAME_TYPE_SUITE 0.420148 EXT_SOURCE_3 19.825307 OCCUPATION_TYPE 31.345545 dtype: float64
5. EXT_SOURCE_2 and EXT_SOURCE_3¶
By seeing data of both the columns seems it can be an additional source for the credit score and can valuable for the analysis further. Dont know extacly what it is but we can remove the NaN values with mean or median
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']]
| EXT_SOURCE_2 | EXT_SOURCE_3 | |
|---|---|---|
| 0 | 0.262949 | 0.139376 |
| 1 | 0.622246 | NaN |
| 2 | 0.555912 | 0.729567 |
| 3 | 0.650442 | NaN |
| 4 | 0.322738 | NaN |
| ... | ... | ... |
| 307506 | 0.681632 | NaN |
| 307507 | 0.115992 | NaN |
| 307508 | 0.535722 | 0.218859 |
| 307509 | 0.514163 | 0.661024 |
| 307510 | 0.708569 | 0.113922 |
307511 rows × 2 columns
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']].aggregate(['mean','median'])
| EXT_SOURCE_2 | EXT_SOURCE_3 | |
|---|---|---|
| mean | 0.514393 | 0.510853 |
| median | 0.565961 | 0.535276 |
application_data[['EXT_SOURCE_2','EXT_SOURCE_3']].describe()
| EXT_SOURCE_2 | EXT_SOURCE_3 | |
|---|---|---|
| count | 3.068510e+05 | 246546.000000 |
| mean | 5.143927e-01 | 0.510853 |
| std | 1.910602e-01 | 0.194844 |
| min | 8.173617e-08 | 0.000527 |
| 25% | 3.924574e-01 | 0.370650 |
| 50% | 5.659614e-01 | 0.535276 |
| 75% | 6.636171e-01 | 0.669057 |
| max | 8.549997e-01 | 0.896010 |
As we can see the description above there is no major difference between mean and max we can either go for mean or we can for median. it is better to change the NaN value with median
application_data['EXT_SOURCE_2'].fillna(application_data['EXT_SOURCE_2'].median(), inplace=True)
application_data['EXT_SOURCE_3'].fillna(application_data['EXT_SOURCE_3'].median(), inplace=True)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
AMT_ANNUITY 0.003902 AMT_GOODS_PRICE 0.090403 NAME_TYPE_SUITE 0.420148 OCCUPATION_TYPE 31.345545 dtype: float64
5. AMT_ANNUITY -- Numerical Variable¶
application_data[['AMT_ANNUITY']]
| AMT_ANNUITY | |
|---|---|
| 0 | 24700.5 |
| 1 | 35698.5 |
| 2 | 6750.0 |
| 3 | 29686.5 |
| 4 | 21865.5 |
| ... | ... |
| 307506 | 27558.0 |
| 307507 | 12001.5 |
| 307508 | 29979.0 |
| 307509 | 20205.0 |
| 307510 | 49117.5 |
307511 rows × 1 columns
application_data[['AMT_ANNUITY']].describe()
| AMT_ANNUITY | |
|---|---|
| count | 307499.000000 |
| mean | 27108.573909 |
| std | 14493.737315 |
| min | 1615.500000 |
| 25% | 16524.000000 |
| 50% | 24903.000000 |
| 75% | 34596.000000 |
| max | 258025.500000 |
sns.boxplot(application_data['AMT_ANNUITY'])
<Axes: ylabel='AMT_ANNUITY'>
As we can see both description and boxplot visualization there are high number of outliers , so we can fill the Nan values with median
application_data['AMT_ANNUITY'].fillna(application_data['AMT_ANNUITY'].median(),inplace=True)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
AMT_GOODS_PRICE 0.090403 NAME_TYPE_SUITE 0.420148 OCCUPATION_TYPE 31.345545 dtype: float64
6. AMT_GOODS_PRICE -- Numerical Variable¶
application_data[['AMT_GOODS_PRICE']]
| AMT_GOODS_PRICE | |
|---|---|
| 0 | 351000.0 |
| 1 | 1129500.0 |
| 2 | 135000.0 |
| 3 | 297000.0 |
| 4 | 513000.0 |
| ... | ... |
| 307506 | 225000.0 |
| 307507 | 225000.0 |
| 307508 | 585000.0 |
| 307509 | 319500.0 |
| 307510 | 675000.0 |
307511 rows × 1 columns
application_data['AMT_GOODS_PRICE'].describe()
count 3.072330e+05 mean 5.383962e+05 std 3.694465e+05 min 4.050000e+04 25% 2.385000e+05 50% 4.500000e+05 75% 6.795000e+05 max 4.050000e+06 Name: AMT_GOODS_PRICE, dtype: float64
sns.boxplot(application_data['AMT_GOODS_PRICE'])
<Axes: ylabel='AMT_GOODS_PRICE'>
As we can see both description and boxplot visualization there are high number of outliers , so we can fill the Nan values with median
application_data['AMT_GOODS_PRICE'].fillna(application_data['AMT_GOODS_PRICE'].median(),inplace=True)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
NAME_TYPE_SUITE 0.420148 OCCUPATION_TYPE 31.345545 dtype: float64
7. NAME_TYPE_SUITE -- Categorical Variable¶
application_data[['NAME_TYPE_SUITE']]
| NAME_TYPE_SUITE | |
|---|---|
| 0 | Unaccompanied |
| 1 | Family |
| 2 | Unaccompanied |
| 3 | Unaccompanied |
| 4 | Unaccompanied |
| ... | ... |
| 307506 | Unaccompanied |
| 307507 | Unaccompanied |
| 307508 | Unaccompanied |
| 307509 | Unaccompanied |
| 307510 | Unaccompanied |
307511 rows × 1 columns
application_data['NAME_TYPE_SUITE'].value_counts()
NAME_TYPE_SUITE Unaccompanied 248526 Family 40149 Spouse, partner 11370 Children 3267 Other_B 1770 Other_A 866 Group of people 271 Name: count, dtype: int64
# Most occcured
application_data['NAME_TYPE_SUITE'].mode()[0]
'Unaccompanied'
# since ita an categorical variable flling NaN values with mode
application_data['NAME_TYPE_SUITE'].fillna(application_data['NAME_TYPE_SUITE'].mode()[0],inplace=True)
8. OCCUPATION_TYPE -- Categorical Variable¶
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = application_data.count()/len(application_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
OCCUPATION_TYPE 31.345545 dtype: float64
application_data['OCCUPATION_TYPE'].value_counts()
OCCUPATION_TYPE Laborers 55186 Sales staff 32102 Core staff 27570 Managers 21371 Drivers 18603 High skill tech staff 11380 Accountants 9813 Medicine staff 8537 Security staff 6721 Cooking staff 5946 Cleaning staff 4653 Private service staff 2652 Low-skill Laborers 2093 Waiters/barmen staff 1348 Secretaries 1305 Realty agents 751 HR staff 563 IT staff 526 Name: count, dtype: int64
# As we can see OCCUPATION_TYPE has 31 % and most of them are laboerers it will not be good approach to fill missing values with mode of that column
sns.countplot(application_data['OCCUPATION_TYPE'])
<Axes: xlabel='count', ylabel='OCCUPATION_TYPE'>
# Lets fill the missing values with others as it will good for the analysis
application_data['OCCUPATION_TYPE'].fillna('Others',inplace=True)
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | DAYS_BIRTH | DAYS_EMPLOYED | DAYS_REGISTRATION | DAYS_ID_PUBLISH | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.018801 | -9461 | -637 | -3648.0 | -2120 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House / apartment | 0.003541 | -16765 | -1188 | -1186.0 | -291 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | 0.535276 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.010032 | -19046 | -225 | -4260.0 | -2531 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary / secondary special | Civil marriage | House / apartment | 0.008019 | -19005 | -3039 | -9833.0 | -2437 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | 0.535276 | 2.0 | 0.0 | 2.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary / secondary special | Single / not married | House / apartment | 0.028663 | -19932 | -3038 | -4311.0 | -3458 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 |
There are some more anomalies in data i.e. Days columns seems to be negative values and days cannot be in negative to need to change it into abs then we can standardise day into Year
for col in application_data:
if col.startswith("DAYS"):
print(col)
DAYS_BIRTH DAYS_EMPLOYED DAYS_REGISTRATION DAYS_ID_PUBLISH
# now leats chage data type of DAYS_REGISTRATION and DAYS_LAST_PHONE_CHANGE float to integer
application_data['YEAR_REGISTRATION'] = application_data['DAYS_REGISTRATION'].astype(int)
application_data['DAYS_BIRTH'] = application_data['DAYS_BIRTH'].abs()
application_data['DAYS_EMPLOYED'] = application_data['DAYS_EMPLOYED'].abs()
application_data['YEAR_REGISTRATION'] = application_data['YEAR_REGISTRATION'].abs()
application_data['DAYS_ID_PUBLISH'] = application_data['DAYS_ID_PUBLISH'].abs()
Lets add new column to change the day_birth as Age and DAYS_EMPLOYED, DAYS_REGISTRATION, DAYS_ID_PUBLISH to Number of Years
application_data['AGE'] = application_data['DAYS_BIRTH']//365
application_data['YEAR_EMPLOYED'] = application_data['DAYS_EMPLOYED']//365
application_data['YEAR_REGISTRATION'] = application_data['YEAR_REGISTRATION']//365
application_data['YEAR_ID_PUBLISH'] = application_data['DAYS_ID_PUBLISH']//365
Note: All days column are now considered as number of years¶
Lets Drop all Days Columns
application_data.drop(columns = "DAYS_BIRTH",axis=1,inplace=True)
application_data.drop(columns = "DAYS_EMPLOYED",axis=1,inplace=True)
application_data.drop(columns = "DAYS_REGISTRATION",axis=1,inplace=True)
application_data.drop(columns = "DAYS_ID_PUBLISH",axis=1,inplace=True)
application_data.shape
(307511, 73)
application_data['NAME_EDUCATION_TYPE'].unique()
array(['Secondary / secondary special', 'Higher education',
'Incomplete higher', 'Lower secondary', 'Academic degree'],
dtype=object)
application_data['NAME_FAMILY_STATUS'].unique()
array(['Single / not married', 'Married', 'Civil marriage', 'Widow',
'Separated', 'Unknown'], dtype=object)
application_data['NAME_HOUSING_TYPE'].unique()
array(['House / apartment', 'Rented apartment', 'With parents',
'Municipal apartment', 'Office apartment', 'Co-op apartment'],
dtype=object)
application_data['ORGANIZATION_TYPE'].unique()
array(['Business Entity Type 3', 'School', 'Government', 'Religion',
'Other', 'XNA', 'Electricity', 'Medicine',
'Business Entity Type 2', 'Self-employed', 'Transport: type 2',
'Construction', 'Housing', 'Kindergarten', 'Trade: type 7',
'Industry: type 11', 'Military', 'Services', 'Security Ministries',
'Transport: type 4', 'Industry: type 1', 'Emergency', 'Security',
'Trade: type 2', 'University', 'Transport: type 3', 'Police',
'Business Entity Type 1', 'Postal', 'Industry: type 4',
'Agriculture', 'Restaurant', 'Culture', 'Hotel',
'Industry: type 7', 'Trade: type 3', 'Industry: type 3', 'Bank',
'Industry: type 9', 'Insurance', 'Trade: type 6',
'Industry: type 2', 'Transport: type 1', 'Industry: type 12',
'Mobile', 'Trade: type 1', 'Industry: type 5', 'Industry: type 10',
'Legal Services', 'Advertising', 'Trade: type 5', 'Cleaning',
'Industry: type 13', 'Trade: type 4', 'Telecom',
'Industry: type 8', 'Realtor', 'Industry: type 6'], dtype=object)
As we can see above NAME_EDUCATION_TYPE has education Secondary / secondary special we can give it to one number Secondary , NAME_FAMILY_STATUS has Single / not married we can give Single and NAME_HOUSING_TYPE has House / apartment we can one name as House
application_data['NAME_EDUCATION_TYPE'].replace("Secondary / secondary special","Secondary",inplace=True)
application_data[['NAME_EDUCATION_TYPE']]
| NAME_EDUCATION_TYPE | |
|---|---|
| 0 | Secondary |
| 1 | Higher education |
| 2 | Secondary |
| 3 | Secondary |
| 4 | Secondary |
| ... | ... |
| 307506 | Secondary |
| 307507 | Secondary |
| 307508 | Higher education |
| 307509 | Secondary |
| 307510 | Higher education |
307511 rows × 1 columns
application_data['NAME_FAMILY_STATUS'].replace("Single / not married","Single",inplace=True)
application_data['NAME_HOUSING_TYPE'].replace("House / apartment","House",inplace=True)
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary | Single | House | 0.018801 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 9 | 25 | 1 | 5 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | 0.535276 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary | Single | House | 0.010032 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 11 | 52 | 0 | 6 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.008019 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | 0.535276 | 2.0 | 0.0 | 2.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 26 | 52 | 8 | 6 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary | Single | House | 0.028663 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 | 11 | 54 | 8 | 9 |
application_data[application_data['ORGANIZATION_TYPE'] == 'XNA']
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | FLAG_DOCUMENT_2 | FLAG_DOCUMENT_3 | FLAG_DOCUMENT_4 | FLAG_DOCUMENT_5 | FLAG_DOCUMENT_6 | FLAG_DOCUMENT_7 | FLAG_DOCUMENT_8 | FLAG_DOCUMENT_9 | FLAG_DOCUMENT_10 | FLAG_DOCUMENT_11 | FLAG_DOCUMENT_12 | FLAG_DOCUMENT_13 | FLAG_DOCUMENT_14 | FLAG_DOCUMENT_15 | FLAG_DOCUMENT_16 | FLAG_DOCUMENT_17 | FLAG_DOCUMENT_18 | FLAG_DOCUMENT_19 | FLAG_DOCUMENT_20 | FLAG_DOCUMENT_21 | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 8 | 100011 | 0 | Cash loans | F | N | Y | 0 | 112500.000 | 1019610.0 | 33826.5 | 913500.0 | Children | Pensioner | Secondary | Married | House | 0.018634 | 1 | 0 | 0 | 1 | 0 | 0 | Others | 2 | 2 | 2 | WEDNESDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.205747 | 0.751724 | 1.0 | 0.0 | 1.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 20 | 55 | 1000 | 9 |
| 11 | 100015 | 0 | Cash loans | F | N | Y | 0 | 38419.155 | 148365.0 | 10678.5 | 135000.0 | Children | Pensioner | Secondary | Married | House | 0.015221 | 1 | 0 | 0 | 1 | 1 | 0 | Others | 2 | 2 | 2 | FRIDAY | 7 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.555183 | 0.652897 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 6 | 14 | 55 | 1000 | 6 |
| 23 | 100027 | 0 | Cash loans | F | N | Y | 0 | 83250.000 | 239850.0 | 23850.0 | 225000.0 | Unaccompanied | Pensioner | Secondary | Married | House | 0.006296 | 1 | 0 | 0 | 1 | 1 | 0 | Others | 2 | 3 | 3 | FRIDAY | 12 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.624305 | 0.669057 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 2 | 24 | 68 | 1000 | 10 |
| 38 | 100045 | 0 | Cash loans | F | N | Y | 0 | 99000.000 | 247275.0 | 17338.5 | 225000.0 | Unaccompanied | Pensioner | Secondary | Married | House | 0.006207 | 1 | 0 | 0 | 1 | 1 | 0 | Others | 2 | 2 | 2 | FRIDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.650765 | 0.751724 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0 | 26 | 65 | 1000 | 13 |
| 43 | 100050 | 0 | Cash loans | F | N | Y | 0 | 108000.000 | 746280.0 | 42970.5 | 675000.0 | Unaccompanied | Pensioner | Higher education | Single | House | 0.010966 | 1 | 0 | 0 | 1 | 0 | 0 | Others | 1 | 2 | 2 | WEDNESDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.766138 | 0.684828 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3.0 | 1 | 15 | 64 | 1000 | 12 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 307469 | 456209 | 0 | Cash loans | F | N | Y | 0 | 202500.000 | 703728.0 | 29943.0 | 607500.0 | Unaccompanied | Pensioner | Secondary | Single | House | 0.031329 | 1 | 0 | 0 | 1 | 0 | 0 | Others | 1 | 2 | 2 | MONDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.195625 | 0.360613 | 11.0 | 2.0 | 11.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 4 | 32 | 61 | 1000 | 11 |
| 307483 | 456227 | 0 | Cash loans | F | N | Y | 0 | 99000.000 | 247275.0 | 16479.0 | 225000.0 | Unaccompanied | Pensioner | Secondary | Separated | House | 0.007330 | 1 | 0 | 0 | 1 | 0 | 0 | Others | 1 | 2 | 2 | MONDAY | 12 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.589906 | 0.520898 | 1.0 | 1.0 | 1.0 | 1.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.0 | 4 | 30 | 68 | 1000 | 11 |
| 307487 | 456231 | 0 | Cash loans | M | N | Y | 0 | 117000.000 | 1071909.0 | 31473.0 | 936000.0 | Unaccompanied | Pensioner | Secondary | Married | House | 0.010147 | 1 | 0 | 0 | 1 | 0 | 0 | Others | 2 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.307082 | 0.255332 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 8.0 | 2 | 15 | 63 | 1000 | 11 |
| 307505 | 456249 | 0 | Cash loans | F | N | Y | 0 | 112500.000 | 225000.0 | 22050.0 | 225000.0 | Unaccompanied | Pensioner | Secondary | Single | House | 0.022800 | 1 | 0 | 0 | 1 | 1 | 0 | Others | 1 | 2 | 2 | MONDAY | 15 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.346391 | 0.742182 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0 | 20 | 66 | 1000 | 6 |
| 307507 | 456252 | 0 | Cash loans | F | N | Y | 0 | 72000.000 | 269550.0 | 12001.5 | 225000.0 | Unaccompanied | Pensioner | Secondary | Widow | House | 0.025164 | 1 | 0 | 0 | 1 | 1 | 0 | Others | 1 | 2 | 2 | MONDAY | 8 | 0 | 0 | 0 | 0 | 0 | 0 | XNA | 0.115992 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 12 | 56 | 1000 | 11 |
55374 rows × 73 columns
As seen above data I have seen that NAME_INCOME_TYPE who are pensioner there ORGANIZATION_TYEP is XNA lets make another organization type as Retired as well
application_data['ORGANIZATION_TYPE'].replace("XNA",'Retired', inplace=True)
application_data['ORGANIZATION_TYPE'].isnull().sum()
0
There in Flag Documents Columns Which tells Us how many Documnets client submitted lets sum up all the columns and put it in a one columns i.e. All_Flag_document and we drop all the flag_document columns
for col in application_data:
if col.startswith("FLAG_DOCUMENT"):
print(col)
FLAG_DOCUMENT_2 FLAG_DOCUMENT_3 FLAG_DOCUMENT_4 FLAG_DOCUMENT_5 FLAG_DOCUMENT_6 FLAG_DOCUMENT_7 FLAG_DOCUMENT_8 FLAG_DOCUMENT_9 FLAG_DOCUMENT_10 FLAG_DOCUMENT_11 FLAG_DOCUMENT_12 FLAG_DOCUMENT_13 FLAG_DOCUMENT_14 FLAG_DOCUMENT_15 FLAG_DOCUMENT_16 FLAG_DOCUMENT_17 FLAG_DOCUMENT_18 FLAG_DOCUMENT_19 FLAG_DOCUMENT_20 FLAG_DOCUMENT_21
All_Document = ["FLAG_DOCUMENT_2", "FLAG_DOCUMENT_3","FLAG_DOCUMENT_4","FLAG_DOCUMENT_5","FLAG_DOCUMENT_6","FLAG_DOCUMENT_7","FLAG_DOCUMENT_8","FLAG_DOCUMENT_9","FLAG_DOCUMENT_10",
"FLAG_DOCUMENT_11","FLAG_DOCUMENT_12","FLAG_DOCUMENT_13","FLAG_DOCUMENT_14","FLAG_DOCUMENT_15","FLAG_DOCUMENT_16","FLAG_DOCUMENT_17","FLAG_DOCUMENT_18",
"FLAG_DOCUMENT_19","FLAG_DOCUMENT_20","FLAG_DOCUMENT_21"]
application_data['All_Flag_document'] = application_data[All_Document].sum(axis= 1)
application_data.drop(All_Document, axis=1,inplace = True)
application_data.shape
(307511, 54)
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | All_Flag_document | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary | Single | House | 0.018801 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 9 | 25 | 1 | 5 | 1 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | 0.535276 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary | Single | House | 0.010032 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 11 | 52 | 0 | 6 | 0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.008019 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | 0.535276 | 2.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 26 | 52 | 8 | 6 | 1 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary | Single | House | 0.028663 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 | 11 | 54 | 8 | 9 | 1 |
Segragatting Numericqal and Categorical Columns¶
Cat_col = list(set(application_data.columns) - set(application_data.describe().columns))
Num_col = application_data.describe().columns
application_data[Cat_col].head()
| OCCUPATION_TYPE | NAME_EDUCATION_TYPE | CODE_GENDER | NAME_TYPE_SUITE | FLAG_OWN_REALTY | NAME_HOUSING_TYPE | WEEKDAY_APPR_PROCESS_START | ORGANIZATION_TYPE | NAME_INCOME_TYPE | NAME_FAMILY_STATUS | FLAG_OWN_CAR | NAME_CONTRACT_TYPE | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | Laborers | Secondary | M | Unaccompanied | Y | House | WEDNESDAY | Business Entity Type 3 | Working | Single | N | Cash loans |
| 1 | Core staff | Higher education | F | Family | N | House | MONDAY | School | State servant | Married | N | Cash loans |
| 2 | Laborers | Secondary | M | Unaccompanied | Y | House | MONDAY | Government | Working | Single | Y | Revolving loans |
| 3 | Laborers | Secondary | F | Unaccompanied | Y | House | WEDNESDAY | Business Entity Type 3 | Working | Civil marriage | N | Cash loans |
| 4 | Core staff | Secondary | M | Unaccompanied | Y | House | THURSDAY | Religion | Working | Single | N | Cash loans |
application_data[Num_col].head()
| SK_ID_CURR | TARGET | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | All_Flag_document | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | 0.018801 | 1 | 1 | 0 | 1 | 1 | 0 | 1 | 2 | 2 | 10 | 0 | 0 | 0 | 0 | 0 | 0 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 9 | 25 | 1 | 5 | 1 |
| 1 | 100003 | 0 | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | 0.003541 | 1 | 1 | 0 | 1 | 1 | 0 | 2 | 1 | 1 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0.622246 | 0.535276 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 |
| 2 | 100004 | 0 | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | 0.010032 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | 2 | 2 | 9 | 0 | 0 | 0 | 0 | 0 | 0 | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 11 | 52 | 0 | 6 | 0 |
| 3 | 100006 | 0 | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | 0.008019 | 1 | 1 | 0 | 1 | 0 | 0 | 2 | 2 | 2 | 17 | 0 | 0 | 0 | 0 | 0 | 0 | 0.650442 | 0.535276 | 2.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 26 | 52 | 8 | 6 | 1 |
| 4 | 100007 | 0 | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | 0.028663 | 1 | 1 | 0 | 1 | 0 | 0 | 1 | 2 | 2 | 11 | 0 | 0 | 0 | 0 | 1 | 1 | 0.322738 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 | 11 | 54 | 8 | 9 | 1 |
Handling Outlier For Application Data¶
- AMT_INCOME_TOTAL
- AMT_CREDIT
- AMT_ANNUITY
- AMT_GOODS_PRICE
- EXT_SOURCE_2
- EXT_SOURCE_3
- AGE
1. AMT_INCOME_TOTAL¶
application_data['AMT_INCOME_TOTAL'].describe()
count 3.075110e+05 mean 1.687979e+05 std 2.371231e+05 min 2.565000e+04 25% 1.125000e+05 50% 1.471500e+05 75% 2.025000e+05 max 1.170000e+08 Name: AMT_INCOME_TOTAL, dtype: float64
sns.boxplot(application_data.AMT_INCOME_TOTAL)
<Axes: ylabel='AMT_INCOME_TOTAL'>
#print the quantile (0.5, 0.7, 0.9, 0.95 and 0.99) of Amt Income Total variable
application_data.AMT_INCOME_TOTAL.quantile([0.5,0.7,0.95,0.99])
0.50 147150.0 0.70 180000.0 0.95 337500.0 0.99 472500.0 Name: AMT_INCOME_TOTAL, dtype: float64
application_data['AMT_INCOME_TOTAL'].max()
117000000.0
application_data[application_data['AMT_INCOME_TOTAL'] == 117000000.0]
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | All_Flag_document | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 12840 | 114967 | 1 | Cash loans | F | N | Y | 1 | 117000000.0 | 562491.0 | 26194.5 | 454500.0 | Unaccompanied | Working | Secondary | Married | House | 0.010643 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 3 | 2 | 2 | TUESDAY | 14 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.113161 | 0.145543 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0 | 18 | 34 | 2 | 9 | 1 |
As we can see above the person income is 117000000.0 and have 2 years of employed where age is 34 this i think wont be possible where occuoation type is laboers definately this is an outlier. so we need to remove an outlier for amt_income_total
Q1 = application_data['AMT_INCOME_TOTAL'].quantile(0.25)
Q3 = application_data['AMT_INCOME_TOTAL'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['AMT_INCOME_TOTAL'] > lower_range) & (application_data['AMT_INCOME_TOTAL'] < higher_range)]
application_data.shape
(291686, 54)
application_data.shape
(291686, 54)
2. AMT_CREDIT¶
sns.boxplot(application_data.AMT_CREDIT)
<Axes: ylabel='AMT_CREDIT'>
Q1 = application_data['AMT_CREDIT'].quantile(0.25)
Q3 = application_data['AMT_CREDIT'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['AMT_CREDIT'] > lower_range) & (application_data['AMT_CREDIT'] < higher_range)]
application_data.shape
(286024, 54)
2. AMT_ANNUITY¶
sns.boxplot(application_data['AMT_ANNUITY'])
<Axes: ylabel='AMT_ANNUITY'>
There are outliers in AMT_ANNUITY but we cannot drop these columns because amt annuity is the payment made by the client on the regular basis
sns.boxplot(application_data['AMT_GOODS_PRICE'])
<Axes: ylabel='AMT_GOODS_PRICE'>
There are outliers in AMT_GOODS_PRICE but we cannot drop these columns because AMT_GOODS_PRICE is the price of goods for which client has taken the loan
3. EXT_SOURCE_2¶
sns.boxplot(application_data['EXT_SOURCE_2'])
<Axes: ylabel='EXT_SOURCE_2'>
There are not single putlier found in EXT_SOURCE_2 column
sns.boxplot(application_data['EXT_SOURCE_3'])
<Axes: ylabel='EXT_SOURCE_3'>
Q1 = application_data['EXT_SOURCE_3'].quantile(0.25)
Q3 = application_data['EXT_SOURCE_3'].quantile(0.75)
IQR = Q3 - Q1
lower_range = Q1-1.5*IQR
higher_range = Q3+1.5*IQR
application_data = application_data[(application_data['EXT_SOURCE_3'] > lower_range) & (application_data['EXT_SOURCE_3'] < higher_range)]
sns.boxplot(application_data['EXT_SOURCE_3'])
<Axes: ylabel='EXT_SOURCE_3'>
application_data.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE | CODE_GENDER | FLAG_OWN_CAR | FLAG_OWN_REALTY | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT | AMT_ANNUITY | AMT_GOODS_PRICE | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | FLAG_MOBIL | FLAG_EMP_PHONE | FLAG_WORK_PHONE | FLAG_CONT_MOBILE | FLAG_PHONE | FLAG_EMAIL | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | EXT_SOURCE_2 | EXT_SOURCE_3 | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | All_Flag_document | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | N | Y | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary | Single | House | 0.018801 | 1 | 1 | 0 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.262949 | 0.139376 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 9 | 25 | 1 | 5 | 1 |
| 1 | 100003 | 0 | Cash loans | F | N | N | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | 1 | 1 | 0 | 1 | 1 | 0 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 0.622246 | 0.535276 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 |
| 2 | 100004 | 0 | Revolving loans | M | Y | Y | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary | Single | House | 0.010032 | 1 | 1 | 1 | 1 | 1 | 0 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.555912 | 0.729567 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 11 | 52 | 0 | 6 | 0 |
| 3 | 100006 | 0 | Cash loans | F | N | Y | 0 | 135000.0 | 312682.5 | 29686.5 | 297000.0 | Unaccompanied | Working | Secondary | Civil marriage | House | 0.008019 | 1 | 1 | 0 | 1 | 0 | 0 | Laborers | 2 | 2 | 2 | WEDNESDAY | 17 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 0.650442 | 0.535276 | 2.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 1 | 26 | 52 | 8 | 6 | 1 |
| 4 | 100007 | 0 | Cash loans | M | N | Y | 0 | 121500.0 | 513000.0 | 21865.5 | 513000.0 | Unaccompanied | Working | Secondary | Single | House | 0.028663 | 1 | 1 | 0 | 1 | 0 | 0 | Core staff | 1 | 2 | 2 | THURSDAY | 11 | 0 | 0 | 0 | 0 | 1 | 1 | Religion | 0.322738 | 0.535276 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 3 | 11 | 54 | 8 | 9 | 1 |
application_data.shape
(281692, 54)
Data Visualization¶
Target column Analysis¶
As we can see below only 8.0 % of the people are the defaulters as compare to repayor's that is 92.0% this is an imbalanced data where we can see there only few defaulters¶
- 1 represents the defaulters who get difficulty to repay the loan
- 0 represents the repayer who was able to replay the loan
plt.pie(application_data['TARGET'].value_counts(), labels= application_data['TARGET'].value_counts().index, autopct='%1.1f%%')
plt.title("Distribustion of Target Variable")
plt.show()
Ratio = (application_data['TARGET']==0).sum() /(application_data['TARGET']== 1).sum()
Ratio
11.540266215554468
Binning Some columns¶
1. Age column¶
application_data['Age'] = pd.cut(application_data.AGE,[0,10,20,30,40,50,60,70,80,90,100], labels=['0-10','10-20','20-30','30-40','40-50','50-60','60-70','70-80','80-90','90-100'])
#application_data.drop(columns='AGE',inplace=True,axis= 1)
application_data.Age.value_counts()
Age 30-40 75076 40-50 66883 50-60 62511 20-30 49094 60-70 28127 10-20 1 0-10 0 70-80 0 80-90 0 90-100 0 Name: count, dtype: int64
2. Amt Income¶
application_data['AMT_INCOME_TOTAL'].describe()
count 281692.000000 mean 151664.187576 std 61891.551436 min 25650.000000 25% 112500.000000 50% 135000.000000 75% 185863.500000 max 337050.000000 Name: AMT_INCOME_TOTAL, dtype: float64
application_data['AMT_INCOME_GROUP'] = pd.cut(application_data['AMT_INCOME_TOTAL'],[25000, 30000, 40000, 50000, 60000, 70000, 80000, 90000, 100000, 200000, 300000, 400000], labels=['25k - 30k', '30k - 40k', '40k - 50k', '50k - 60k', '60k - 70k', '70k - 80k', '80k - 90k', '90k - 100k', '100k - 200k', '200k - 300k', '300k - 400k' ])
#application_data.drop(columns='AMT_INCOME_TOTAL', inplace=True, axis=1)
application_data.AMT_INCOME_GROUP.value_counts()
AMT_INCOME_GROUP 100k - 200k 151617 200k - 300k 61073 80k - 90k 30147 60k - 70k 12733 300k - 400k 6133 70k - 80k 5841 90k - 100k 5674 50k - 60k 3995 40k - 50k 3522 30k - 40k 818 25k - 30k 139 Name: count, dtype: int64
2. Amt Credit¶
application_data['AMT_CREDIT'].describe()
count 2.816920e+05 mean 5.536835e+05 std 3.418958e+05 min 4.500000e+04 25% 2.700000e+05 50% 4.950000e+05 75% 7.702920e+05 max 1.562076e+06 Name: AMT_CREDIT, dtype: float64
application_data['AMT_CREDIT_GROUP'] = pd.cut(application_data['AMT_CREDIT'], [45000, 270000, 513531, 808650, 1200000, 1600000, 2000000, 2500000, 4050000], labels=['45k-270k', '270k-513.5k', '513.5k-808.65k', '808.65k-1.2M', '1.2M-1.6M', '1.6M-2M', '2M-2.5M', '2.5M-4.05M'])
#application_data.drop(columns= 'AMT_CREDIT', inplace=True, axis= 1)
application_data.AMT_CREDIT_GROUP.value_counts()
AMT_CREDIT_GROUP 45k-270k 75968 513.5k-808.65k 73002 270k-513.5k 72055 808.65k-1.2M 43273 1.2M-1.6M 17170 1.6M-2M 0 2M-2.5M 0 2.5M-4.05M 0 Name: count, dtype: int64
Done with the Data cleaning now analyzing the columns¶
UNIVARIATE ANALYSIS¶
def plotting(data, column):
fig = plt.figure(figsize=(16,6))
ax1 = plt.subplot(1,2,1)
sns.countplot(data = data, palette='Set1',x = column, ax = ax1)
plt.title('Plotting data for the column: '+ column)
plt.xticks(rotation=90)
ax1.set_yscale("log")
ax2 = plt.subplot(1,2,2)
sns.countplot(data = data, palette='Set1',x = column, hue= 'TARGET', ax = ax2) # hue is used to differentiate the target variable
plt.title('Plotting data for the column: '+ column)
plt.xticks(rotation=90)
ax2.set_yscale("log")
plt.tight_layout() # Or equivalently, "plt.tight_layout()"
plt.show()
cat_col = list(set(application_data.columns) - set(application_data.describe().columns))
num_col = application_data.describe().columns
cat_col
['OCCUPATION_TYPE', 'NAME_EDUCATION_TYPE', 'Age', 'CODE_GENDER', 'NAME_TYPE_SUITE', 'FLAG_OWN_REALTY', 'NAME_HOUSING_TYPE', 'AMT_CREDIT_GROUP', 'AMT_INCOME_GROUP', 'WEEKDAY_APPR_PROCESS_START', 'ORGANIZATION_TYPE', 'NAME_INCOME_TYPE', 'NAME_FAMILY_STATUS', 'FLAG_OWN_CAR', 'NAME_CONTRACT_TYPE']
Plotting Categorical Columns¶
for column in cat_col:
plotting(application_data, column)
print('----------------------------------------------------------------------------------------------------------------------------------------------------------')
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------
Insights From Categorical column¶
1 NAME_CONTRACT_TYPE¶
client owns a house or flat¶
- The majority of the client own a house.flat thta is just a double who dont owe a flat
- As compare to target variable we see that there is no much difference between the owning the flat and loan defaulters it can seen that the people who dont owe a flat also considered as a defaulter
2. OCCUPATION TYPE¶
Client Occupation and loan defaulters¶
Count of the people taken loan¶
- The majority of people who has tajken a loan comes under "laborers" , "Others" (These are the clients whose occupation is not mentioned in the data) ,"Core staff", "Sales Staff"
- It and Hr Staff comes under small group who are taen the loan
Loan Defaulter¶
- Majority of the people who has taken the loan are comes under the loan defaulter
- We can also see to highest cateogry is "Loberers" and "Others" and "Drivers" ,"Sales Staff", "Cleaning Staff" who are not high skilled to there education
3. AGE¶
Count of the people taken loan¶
- We can see all the age group has taken loan
- 10 - 20 age group are very less who has taken the loan because majority of the people are students or and who has taken the loan has taken education loan who are above 18 years of the age group
Loan Defaulter¶
- Majority of the people who are under 30- 40 has done loan default
- As the age increses there are less chance to do the default in loan > 70 year of age
4. WEEkDAY_APP_PROCESS_START¶
Count of the people taken loan¶
- As we can see mostly all weeks clients has taken the loan except sunday because mostly sunday are holidays
Loan Defaulter¶
- As I cann see the plot with the default loan repay there nothing much we can see in the weekday_app_process_start column I can also drop this column from my analysis
5. AMT_CREDIT_GROUP¶
6. Name_Housing_Type¶
7. Name_Contract_Type¶
8. Name_Family_Status¶
9. NAME_INCOME_TYPE¶
Count of the people taken loan¶
- Most of client for loans have income type as Working, followed by Commercial associate, Pensioner and State servant
Loan Defaulter¶
- if we see the client who are on maternity leave has the highest possibility not to repay the loan, followed by unemployed.
- Students and businessman are not done in default ratio these are group which safer to provide the loan.
10. FLAG_OWN_CAR¶
11. AMT_INCOME_GROUP¶
12. NAME_EDUCATION_TYPE¶
13. CODE_GENDER¶
14. NAME_TYPE_SUTE¶
Count of the people taken loan¶
- Majority client are unaccompanied while applying for the loan and the clients were with their families and less amount of them were accompanied by their spouse.
- .
Loan Defaulter¶
- ALL of them have the same default rate, therefore the people who accompany the client while applying for the loan does not play a big role in deciding whether he/she defaults the loan.n
15. FLAG_OWN_REALTY¶
Count of the people taken loan¶
The majority of clients in the dataset own real estate. Clients who own real estate are more than double in number compared to those who don’t ow### n
Loan Defaulter
Based on the data, there doesn’t appear to be a significant correlation between owning real estate and loan defalting.
Plotting Numerical Columns¶
num_col
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'FLAG_MOBIL', 'FLAG_EMP_PHONE',
'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL',
'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE', 'YEAR_EMPLOYED',
'YEAR_ID_PUBLISH', 'All_Flag_document'],
dtype='object')
correlation = application_data[num_col].corr()
correlation.TARGET.sort_values(ascending=True)
EXT_SOURCE_2 -0.156071 EXT_SOURCE_3 -0.140068 AGE -0.077633 YEAR_LAST_PHONE_CHANGE -0.056615 YEAR_ID_PUBLISH -0.050140 YEAR_EMPLOYED -0.048522 YEAR_REGISTRATION -0.043160 REGION_POPULATION_RELATIVE -0.032299 AMT_GOODS_PRICE -0.026982 FLAG_PHONE -0.024581 HOUR_APPR_PROCESS_START -0.023639 AMT_CREDIT -0.016544 AMT_REQ_CREDIT_BUREAU_MON -0.014527 AMT_INCOME_TOTAL -0.012079 AMT_REQ_CREDIT_BUREAU_QRT -0.008763 AMT_REQ_CREDIT_BUREAU_WEEK -0.002383 SK_ID_CURR -0.001632 FLAG_EMAIL -0.001489 AMT_REQ_CREDIT_BUREAU_HOUR -0.000324 FLAG_CONT_MOBILE -0.000271 FLAG_MOBIL 0.000555 AMT_REQ_CREDIT_BUREAU_DAY 0.002053 AMT_ANNUITY 0.002669 LIVE_REGION_NOT_WORK_REGION 0.005222 REG_REGION_NOT_LIVE_REGION 0.005793 OBS_60_CNT_SOCIAL_CIRCLE 0.008020 OBS_30_CNT_SOCIAL_CIRCLE 0.008181 REG_REGION_NOT_WORK_REGION 0.009107 AMT_REQ_CREDIT_BUREAU_YEAR 0.010842 CNT_FAM_MEMBERS 0.011666 All_Flag_document 0.018902 CNT_CHILDREN 0.021147 FLAG_WORK_PHONE 0.027574 DEF_60_CNT_SOCIAL_CIRCLE 0.031207 DEF_30_CNT_SOCIAL_CIRCLE 0.032278 LIVE_CITY_NOT_WORK_CITY 0.032832 REG_CITY_NOT_LIVE_CITY 0.044125 FLAG_EMP_PHONE 0.047510 REG_CITY_NOT_WORK_CITY 0.051325 REGION_RATING_CLIENT 0.055849 REGION_RATING_CLIENT_W_CITY 0.057660 TARGET 1.000000 Name: TARGET, dtype: float64
Lets first Drop irrelevant columns from the numerical data
flag_own_drop = []
for col in application_data:
if col.startswith("FLAG"):
flag_own_drop.append(col)
print(flag_own_drop)
['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL']
- We can see EXT_SOURCE_2 and EXT_SOURCE_3 are negative correlation with target variable we can drop these two columns
- Need to Flag Columns mentioned above need to remove it
drop_columns = ['FLAG_OWN_CAR', 'FLAG_OWN_REALTY', 'FLAG_MOBIL', 'FLAG_EMP_PHONE', 'FLAG_WORK_PHONE', 'FLAG_CONT_MOBILE', 'FLAG_PHONE', 'FLAG_EMAIL','EXT_SOURCE_2','EXT_SOURCE_3']
application_data.drop(columns=drop_columns, axis =1 , inplace= True)
application_data.shape
(281692, 47)
num_col = application_data.describe().columns
num_col
Index(['SK_ID_CURR', 'TARGET', 'CNT_CHILDREN', 'AMT_INCOME_TOTAL',
'AMT_CREDIT', 'AMT_ANNUITY', 'AMT_GOODS_PRICE',
'REGION_POPULATION_RELATIVE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'HOUR_APPR_PROCESS_START',
'REG_REGION_NOT_LIVE_REGION', 'REG_REGION_NOT_WORK_REGION',
'LIVE_REGION_NOT_WORK_REGION', 'REG_CITY_NOT_LIVE_CITY',
'REG_CITY_NOT_WORK_CITY', 'LIVE_CITY_NOT_WORK_CITY',
'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE', 'YEAR_EMPLOYED',
'YEAR_ID_PUBLISH', 'All_Flag_document'],
dtype='object')
Lets do some univariate analysis on amt columns and cnt Family Members columns¶
num_col_anlysis = application_data[[ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE','CNT_FAM_MEMBERS']]
Segregating the dataset on Target=0 and Target=1¶
train_0 = application_data.loc[application_data['TARGET'] == 0] # 0 - Loan was repaid
train_1 = application_data.loc[application_data['TARGET'] == 1] # 1 - Loan was not repaid
for column in num_col_anlysis:
title = "Plot of "+column
print(title)
plt.hist(train_0[column], alpha=0.5, label='0') # plotting a histogram for target 0 i.e. loan was repaid
plt.hist(train_1[column], alpha=0.5, label='1') # plotting a histogram for target 1 i.e. loan was not repaid
plt.legend()
plt.show()
sns.distplot(train_0[column], label='0') # plotting a distribution plot for target 0 i.e. loan was repaid
sns.distplot(train_1[column], label='1') # plotting a distribution plot for target 1 i.e. loan was not repaid
plt.legend()
plt.show()
print("------------------------------------------------------------------------")
Plot of AMT_INCOME_TOTAL
------------------------------------------------------------------------ Plot of AMT_CREDIT
------------------------------------------------------------------------ Plot of AMT_ANNUITY
------------------------------------------------------------------------ Plot of AMT_GOODS_PRICE
------------------------------------------------------------------------ Plot of CNT_FAM_MEMBERS
------------------------------------------------------------------------
Insights¶
I have considered dist plot to give the insights
AMT_INCOME_TOTAL¶
- The income type for both targeted type lied around 1000000 to 150000
AMT_CREDIT¶
- Most of the credit amount 25k to 50 k
AMT_ANNUITY¶
- Most client pay an annuity below 60K for their credit loans.
AMT_GOODS_PRICE¶
- The majority of loans are granted for goods prices below 1M.
CNT_FAM_MEMBERS¶
- the mojority of family members have size < 3 have difficult to repay the loan
Bivariate & Multivariate Analysis¶
# Create a pairplot with hue as 'TARGET'
sns.pairplot(data = application_data, vars = [ 'AMT_INCOME_TOTAL','AMT_CREDIT','AMT_ANNUITY', 'AMT_GOODS_PRICE'], hue = 'TARGET', diag_kind='kde')
<seaborn.axisgrid.PairGrid at 0x14f578efe00>
Insights From Analysis¶
AMT_CREDIT and AMT_GOODS_PRICEare highly corelated to each other as based on the scatterplot where most of the data are consolidated in form of a line There are very less defaulters for AMT_CREDIT > 1M 3There are very few defaulters for loan amounts exceeding 1M. This suggests that clients who borrow larger amount are less likely to default on their loans.ion
Previous Application ¶
previous_data = pd.read_csv("previous_application.csv")
previous_data.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_DOWN_PAYMENT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | RATE_DOWN_PAYMENT | RATE_INTEREST_PRIMARY | RATE_INTEREST_PRIVILEGED | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_TYPE_SUITE | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | DAYS_FIRST_DRAWING | DAYS_FIRST_DUE | DAYS_LAST_DUE_1ST_VERSION | DAYS_LAST_DUE | DAYS_TERMINATION | NFLAG_INSURED_ON_APPROVAL | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 0.0 | 17145.0 | SATURDAY | 15 | Y | 1 | 0.0 | 0.182832 | 0.867336 | XAP | Approved | -73 | Cash through the bank | XAP | NaN | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 365243.0 | -42.0 | 300.0 | -42.0 | -37.0 | 0.0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | NaN | 607500.0 | THURSDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -164 | XNA | XAP | Unaccompanied | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 365243.0 | -134.0 | 916.0 | 365243.0 | 365243.0 | 1.0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | NaN | 112500.0 | TUESDAY | 11 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -301 | Cash through the bank | XAP | Spouse, partner | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 365243.0 | -271.0 | 59.0 | 365243.0 | 365243.0 | 1.0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | NaN | 450000.0 | MONDAY | 7 | Y | 1 | NaN | NaN | NaN | XNA | Approved | -512 | Cash through the bank | XAP | NaN | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 365243.0 | -482.0 | -152.0 | -182.0 | -177.0 | 1.0 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | NaN | 337500.0 | THURSDAY | 9 | Y | 1 | NaN | NaN | NaN | Repairs | Refused | -781 | Cash through the bank | HC | NaN | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | NaN | NaN | NaN | NaN | NaN | NaN |
previous_data.shape
(1670214, 37)
previous_data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1670214 entries, 0 to 1670213 Data columns (total 37 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 SK_ID_PREV 1670214 non-null int64 1 SK_ID_CURR 1670214 non-null int64 2 NAME_CONTRACT_TYPE 1670214 non-null object 3 AMT_ANNUITY 1297979 non-null float64 4 AMT_APPLICATION 1670214 non-null float64 5 AMT_CREDIT 1670213 non-null float64 6 AMT_DOWN_PAYMENT 774370 non-null float64 7 AMT_GOODS_PRICE 1284699 non-null float64 8 WEEKDAY_APPR_PROCESS_START 1670214 non-null object 9 HOUR_APPR_PROCESS_START 1670214 non-null int64 10 FLAG_LAST_APPL_PER_CONTRACT 1670214 non-null object 11 NFLAG_LAST_APPL_IN_DAY 1670214 non-null int64 12 RATE_DOWN_PAYMENT 774370 non-null float64 13 RATE_INTEREST_PRIMARY 5951 non-null float64 14 RATE_INTEREST_PRIVILEGED 5951 non-null float64 15 NAME_CASH_LOAN_PURPOSE 1670214 non-null object 16 NAME_CONTRACT_STATUS 1670214 non-null object 17 DAYS_DECISION 1670214 non-null int64 18 NAME_PAYMENT_TYPE 1670214 non-null object 19 CODE_REJECT_REASON 1670214 non-null object 20 NAME_TYPE_SUITE 849809 non-null object 21 NAME_CLIENT_TYPE 1670214 non-null object 22 NAME_GOODS_CATEGORY 1670214 non-null object 23 NAME_PORTFOLIO 1670214 non-null object 24 NAME_PRODUCT_TYPE 1670214 non-null object 25 CHANNEL_TYPE 1670214 non-null object 26 SELLERPLACE_AREA 1670214 non-null int64 27 NAME_SELLER_INDUSTRY 1670214 non-null object 28 CNT_PAYMENT 1297984 non-null float64 29 NAME_YIELD_GROUP 1670214 non-null object 30 PRODUCT_COMBINATION 1669868 non-null object 31 DAYS_FIRST_DRAWING 997149 non-null float64 32 DAYS_FIRST_DUE 997149 non-null float64 33 DAYS_LAST_DUE_1ST_VERSION 997149 non-null float64 34 DAYS_LAST_DUE 997149 non-null float64 35 DAYS_TERMINATION 997149 non-null float64 36 NFLAG_INSURED_ON_APPROVAL 997149 non-null float64 dtypes: float64(15), int64(6), object(16) memory usage: 471.5+ MB
# shows how much percentage of data are null
null_col = (previous_data.isnull().sum()/previous_data.shape[0]*100).sort_values(ascending = False)
null_col[null_col>0]
RATE_INTEREST_PRIVILEGED 99.643698 RATE_INTEREST_PRIMARY 99.643698 AMT_DOWN_PAYMENT 53.636480 RATE_DOWN_PAYMENT 53.636480 NAME_TYPE_SUITE 49.119754 NFLAG_INSURED_ON_APPROVAL 40.298129 DAYS_TERMINATION 40.298129 DAYS_LAST_DUE 40.298129 DAYS_LAST_DUE_1ST_VERSION 40.298129 DAYS_FIRST_DUE 40.298129 DAYS_FIRST_DRAWING 40.298129 AMT_GOODS_PRICE 23.081773 AMT_ANNUITY 22.286665 CNT_PAYMENT 22.286366 PRODUCT_COMBINATION 0.020716 AMT_CREDIT 0.000060 dtype: float64
# Lets See the columns who has more than 40 % of data with Null values
cut_off=40
cols_to_drop=list(previous_data.columns[100*previous_data.isnull().mean()>cut_off])
print(cols_to_drop)
['AMT_DOWN_PAYMENT', 'RATE_DOWN_PAYMENT', 'RATE_INTEREST_PRIMARY', 'RATE_INTEREST_PRIVILEGED', 'NAME_TYPE_SUITE', 'DAYS_FIRST_DRAWING', 'DAYS_FIRST_DUE', 'DAYS_LAST_DUE_1ST_VERSION', 'DAYS_LAST_DUE', 'DAYS_TERMINATION', 'NFLAG_INSURED_ON_APPROVAL']
len(cols_to_drop)
11
previous_data.drop(columns= cols_to_drop, axis= 1, inplace= True)
previous_data.shape
(1670214, 26)
# shows how much percentage of data are null
null_col = (previous_data.isnull().sum()/previous_data.shape[0]*100).sort_values(ascending = False)
null_col[null_col>0]
AMT_GOODS_PRICE 23.081773 AMT_ANNUITY 22.286665 CNT_PAYMENT 22.286366 PRODUCT_COMBINATION 0.020716 AMT_CREDIT 0.000060 dtype: float64
previous_data[['AMT_GOODS_PRICE','AMT_ANNUITY','CNT_PAYMENT','PRODUCT_COMBINATION','AMT_CREDIT']]
| AMT_GOODS_PRICE | AMT_ANNUITY | CNT_PAYMENT | PRODUCT_COMBINATION | AMT_CREDIT | |
|---|---|---|---|---|---|
| 0 | 17145.0 | 1730.430 | 12.0 | POS mobile with interest | 17145.0 |
| 1 | 607500.0 | 25188.615 | 36.0 | Cash X-Sell: low | 679671.0 |
| 2 | 112500.0 | 15060.735 | 12.0 | Cash X-Sell: high | 136444.5 |
| 3 | 450000.0 | 47041.335 | 12.0 | Cash X-Sell: middle | 470790.0 |
| 4 | 337500.0 | 31924.395 | 24.0 | Cash Street: high | 404055.0 |
| ... | ... | ... | ... | ... | ... |
| 1670209 | 267295.5 | 14704.290 | 30.0 | POS industry with interest | 311400.0 |
| 1670210 | 87750.0 | 6622.020 | 12.0 | POS industry with interest | 64291.5 |
| 1670211 | 105237.0 | 11520.855 | 10.0 | POS household with interest | 102523.5 |
| 1670212 | 180000.0 | 18821.520 | 12.0 | Cash X-Sell: low | 191880.0 |
| 1670213 | 360000.0 | 16431.300 | 48.0 | Cash X-Sell: middle | 360000.0 |
1670214 rows × 5 columns
Data Cleaning On Previous Application Data¶
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
AMT_CREDIT 0.000060 PRODUCT_COMBINATION 0.020716 CNT_PAYMENT 22.286366 AMT_ANNUITY 22.286665 AMT_GOODS_PRICE 23.081773 dtype: float64
1. PRODUCT_COMBINATION¶
previous_data['PRODUCT_COMBINATION'].value_counts()
PRODUCT_COMBINATION Cash 285990 POS household with interest 263622 POS mobile with interest 220670 Cash X-Sell: middle 143883 Cash X-Sell: low 130248 Card Street 112582 POS industry with interest 98833 POS household without interest 82908 Card X-Sell 80582 Cash Street: high 59639 Cash X-Sell: high 59301 Cash Street: middle 34658 Cash Street: low 33834 POS mobile without interest 24082 POS other with interest 23879 POS industry without interest 12602 POS others without interest 2555 Name: count, dtype: int64
We change the production combination null values wih mode as percentage of null values are very less we can replace it with a mode
previous_data['PRODUCT_COMBINATION'].fillna(previous_data['PRODUCT_COMBINATION'].mode()[0], inplace= True)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
AMT_CREDIT 0.000060 CNT_PAYMENT 22.286366 AMT_ANNUITY 22.286665 AMT_GOODS_PRICE 23.081773 dtype: float64
Null_columns = previous_data[['AMT_GOODS_PRICE','AMT_ANNUITY','CNT_PAYMENT','AMT_CREDIT']]
Lets see the histogram fro all null columns and then fill the null values with approprimate values
for col in Null_columns:
sns.histplot(previous_data, x = col)
plt.show()
Seams all of them are right skewed its better to fill all null values with median
for col in Null_columns:
med = previous_data[col].median()
previous_data[col] = previous_data[col].fillna(med)
#pd.set_option("display.max_rows",None)
# displaying only columns having null values which is greater than 0
Null_columns = previous_data.count()/len(previous_data)
Null_columns = (1- Null_columns)*100
Null_columns[Null_columns>0].sort_values()
Series([], dtype: float64)
Previous Application Data¶
previous_data.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 17145.0 | SATURDAY | 15 | Y | 1 | XAP | Approved | -73 | Cash through the bank | XAP | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | 607500.0 | THURSDAY | 11 | Y | 1 | XNA | Approved | -164 | XNA | XAP | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | 112500.0 | TUESDAY | 11 | Y | 1 | XNA | Approved | -301 | Cash through the bank | XAP | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | 450000.0 | MONDAY | 7 | Y | 1 | XNA | Approved | -512 | Cash through the bank | XAP | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | 337500.0 | THURSDAY | 9 | Y | 1 | Repairs | Refused | -781 | Cash through the bank | HC | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high |
previous_data['NAME_CONTRACT_TYPE'].value_counts()
NAME_CONTRACT_TYPE Cash loans 747553 Consumer loans 729151 Revolving loans 193164 XNA 346 Name: count, dtype: int64
previous_data['NAME_CONTRACT_STATUS'].value_counts()
NAME_CONTRACT_STATUS Approved 1036781 Canceled 316319 Refused 290678 Unused offer 26436 Name: count, dtype: int64
previous_data['NAME_CASH_LOAN_PURPOSE'].value_counts()
NAME_CASH_LOAN_PURPOSE XAP 922661 XNA 677918 Repairs 23765 Other 15608 Urgent needs 8412 Buying a used car 2888 Building a house or an annex 2693 Everyday expenses 2416 Medicine 2174 Payments on other loans 1931 Education 1573 Journey 1239 Purchase of electronic equipment 1061 Buying a new car 1012 Wedding / gift / holiday 962 Buying a home 865 Car repairs 797 Furniture 749 Buying a holiday home / land 533 Business development 426 Gasification / water supply 300 Buying a garage 136 Hobby 55 Money for a third person 25 Refusal to name the goal 15 Name: count, dtype: int64
lets change days to year in DAYS_DECISION columns beacuse cant be negative and standardise into year
previous_data['YEAR_DECISION'] =previous_data['DAYS_DECISION'].abs()// 365
previous_data.head()
| SK_ID_PREV | SK_ID_CURR | NAME_CONTRACT_TYPE | AMT_ANNUITY | AMT_APPLICATION | AMT_CREDIT | AMT_GOODS_PRICE | WEEKDAY_APPR_PROCESS_START | HOUR_APPR_PROCESS_START | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | DAYS_DECISION | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | YEAR_DECISION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2030495 | 271877 | Consumer loans | 1730.430 | 17145.0 | 17145.0 | 17145.0 | SATURDAY | 15 | Y | 1 | XAP | Approved | -73 | Cash through the bank | XAP | Repeater | Mobile | POS | XNA | Country-wide | 35 | Connectivity | 12.0 | middle | POS mobile with interest | 0 |
| 1 | 2802425 | 108129 | Cash loans | 25188.615 | 607500.0 | 679671.0 | 607500.0 | THURSDAY | 11 | Y | 1 | XNA | Approved | -164 | XNA | XAP | Repeater | XNA | Cash | x-sell | Contact center | -1 | XNA | 36.0 | low_action | Cash X-Sell: low | 0 |
| 2 | 2523466 | 122040 | Cash loans | 15060.735 | 112500.0 | 136444.5 | 112500.0 | TUESDAY | 11 | Y | 1 | XNA | Approved | -301 | Cash through the bank | XAP | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | high | Cash X-Sell: high | 0 |
| 3 | 2819243 | 176158 | Cash loans | 47041.335 | 450000.0 | 470790.0 | 450000.0 | MONDAY | 7 | Y | 1 | XNA | Approved | -512 | Cash through the bank | XAP | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | middle | Cash X-Sell: middle | 1 |
| 4 | 1784265 | 202054 | Cash loans | 31924.395 | 337500.0 | 404055.0 | 337500.0 | THURSDAY | 9 | Y | 1 | Repairs | Refused | -781 | Cash through the bank | HC | Repeater | XNA | Cash | walk-in | Credit and cash offices | -1 | XNA | 24.0 | high | Cash Street: high | 2 |
previous_data.drop(columns= ['DAYS_DECISION'], axis =1 , inplace = True)
previous_data['NAME_YIELD_GROUP'].value_counts()
NAME_YIELD_GROUP XNA 517215 middle 385532 high 353331 low_normal 322095 low_action 92041 Name: count, dtype: int64
Data Visualization on Previous Application Data¶
preanalysis_col = ['NAME_CONTRACT_TYPE','NAME_CONTRACT_STATUS','NAME_CLIENT_TYPE']
def prev_app_analysis(data, columns):
plt.figure(figsize=(8, 6))
ax1 = plt.subplot(1,2,2)
previous_data[columns].value_counts().plot.pie(autopct = "%1.0f%%",ax = ax1)
plt.xlabel(columns)
plt.axis('equal')
plt.title('Distribution of' +" "+ columns)
plt.xticks(rotation=90)
plt.show()
for col in preanalysis_col:
prev_app_analysis(preanalysis_col, col)
Insights On Previous Application¶
1. Name Contract Type¶
- Cash loans: The most common contract type, accounting for 45% of all contracts.
- Consumer loans: A significant portion of contracts, making up 44% of the total.
- Revolving loans: Less common, representing 12% of contracts. Typically associated with credit cards or lines of credit.
- There is 0 % of XNA category
2. Name Contract Status¶
- The majority of loan applications were successfully approved ie 62% approved loan..
- Approximately 19% of applicants canceled their loan requests
- About 17% of applications werrefuseded, highlighting the importance of assessing rejection criteria and applicant profile
- A small percentage received offers but didn’t proceed, warranting investigation into factors influencing this decisiosks.
3. Name Client Type¶
- Repeater: 74% of clients fall into this category.
- New: Approximately 10% of clients are categorized as “New.
- Refreshed: About.1% of clients are classified as “Refreshed
- There is 0 % of XNA percentageA.”
Merging the application Data and previous application data¶
data_final = application_data.merge(previous_data, left_on='SK_ID_CURR',
right_on='SK_ID_CURR', how='inner')
data_final.head()
| SK_ID_CURR | TARGET | NAME_CONTRACT_TYPE_x | CODE_GENDER | CNT_CHILDREN | AMT_INCOME_TOTAL | AMT_CREDIT_x | AMT_ANNUITY_x | AMT_GOODS_PRICE_x | NAME_TYPE_SUITE | NAME_INCOME_TYPE | NAME_EDUCATION_TYPE | NAME_FAMILY_STATUS | NAME_HOUSING_TYPE | REGION_POPULATION_RELATIVE | OCCUPATION_TYPE | CNT_FAM_MEMBERS | REGION_RATING_CLIENT | REGION_RATING_CLIENT_W_CITY | WEEKDAY_APPR_PROCESS_START_x | HOUR_APPR_PROCESS_START_x | REG_REGION_NOT_LIVE_REGION | REG_REGION_NOT_WORK_REGION | LIVE_REGION_NOT_WORK_REGION | REG_CITY_NOT_LIVE_CITY | REG_CITY_NOT_WORK_CITY | LIVE_CITY_NOT_WORK_CITY | ORGANIZATION_TYPE | OBS_30_CNT_SOCIAL_CIRCLE | DEF_30_CNT_SOCIAL_CIRCLE | OBS_60_CNT_SOCIAL_CIRCLE | DEF_60_CNT_SOCIAL_CIRCLE | AMT_REQ_CREDIT_BUREAU_HOUR | AMT_REQ_CREDIT_BUREAU_DAY | AMT_REQ_CREDIT_BUREAU_WEEK | AMT_REQ_CREDIT_BUREAU_MON | AMT_REQ_CREDIT_BUREAU_QRT | AMT_REQ_CREDIT_BUREAU_YEAR | YEAR_LAST_PHONE_CHANGE | YEAR_REGISTRATION | AGE | YEAR_EMPLOYED | YEAR_ID_PUBLISH | All_Flag_document | Age | AMT_INCOME_GROUP | AMT_CREDIT_GROUP | SK_ID_PREV | NAME_CONTRACT_TYPE_y | AMT_ANNUITY_y | AMT_APPLICATION | AMT_CREDIT_y | AMT_GOODS_PRICE_y | WEEKDAY_APPR_PROCESS_START_y | HOUR_APPR_PROCESS_START_y | FLAG_LAST_APPL_PER_CONTRACT | NFLAG_LAST_APPL_IN_DAY | NAME_CASH_LOAN_PURPOSE | NAME_CONTRACT_STATUS | NAME_PAYMENT_TYPE | CODE_REJECT_REASON | NAME_CLIENT_TYPE | NAME_GOODS_CATEGORY | NAME_PORTFOLIO | NAME_PRODUCT_TYPE | CHANNEL_TYPE | SELLERPLACE_AREA | NAME_SELLER_INDUSTRY | CNT_PAYMENT | NAME_YIELD_GROUP | PRODUCT_COMBINATION | YEAR_DECISION | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 100002 | 1 | Cash loans | M | 0 | 202500.0 | 406597.5 | 24700.5 | 351000.0 | Unaccompanied | Working | Secondary | Single | House | 0.018801 | Laborers | 1 | 2 | 2 | WEDNESDAY | 10 | 0 | 0 | 0 | 0 | 0 | 0 | Business Entity Type 3 | 2.0 | 2.0 | 2.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 9 | 25 | 1 | 5 | 1 | 20-30 | 200k - 300k | 270k-513.5k | 1038818 | Consumer loans | 9251.775 | 179055.0 | 179055.0 | 179055.0 | SATURDAY | 9 | Y | 1 | XAP | Approved | XNA | XAP | New | Vehicles | POS | XNA | Stone | 500 | Auto technology | 24.0 | low_normal | POS other with interest | 1 |
| 1 | 100003 | 0 | Cash loans | F | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 | 40-50 | 200k - 300k | 1.2M-1.6M | 1810518 | Cash loans | 98356.995 | 900000.0 | 1035882.0 | 900000.0 | FRIDAY | 12 | Y | 1 | XNA | Approved | XNA | XAP | Repeater | XNA | Cash | x-sell | Credit and cash offices | -1 | XNA | 12.0 | low_normal | Cash X-Sell: low | 2 |
| 2 | 100003 | 0 | Cash loans | F | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 | 40-50 | 200k - 300k | 1.2M-1.6M | 2636178 | Consumer loans | 64567.665 | 337500.0 | 348637.5 | 337500.0 | SUNDAY | 17 | Y | 1 | XAP | Approved | Cash through the bank | XAP | Refreshed | Furniture | POS | XNA | Stone | 1400 | Furniture | 6.0 | middle | POS industry with interest | 2 |
| 3 | 100003 | 0 | Cash loans | F | 0 | 270000.0 | 1293502.5 | 35698.5 | 1129500.0 | Family | State servant | Higher education | Married | House | 0.003541 | Core staff | 2 | 1 | 1 | MONDAY | 11 | 0 | 0 | 0 | 0 | 0 | 0 | School | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 3 | 45 | 3 | 0 | 1 | 40-50 | 200k - 300k | 1.2M-1.6M | 2396755 | Consumer loans | 6737.310 | 68809.5 | 68053.5 | 68809.5 | SATURDAY | 15 | Y | 1 | XAP | Approved | Cash through the bank | XAP | Refreshed | Consumer Electronics | POS | XNA | Country-wide | 200 | Consumer electronics | 12.0 | middle | POS household with interest | 6 |
| 4 | 100004 | 0 | Revolving loans | M | 0 | 67500.0 | 135000.0 | 6750.0 | 135000.0 | Unaccompanied | Working | Secondary | Single | House | 0.010032 | Laborers | 1 | 2 | 2 | MONDAY | 9 | 0 | 0 | 0 | 0 | 0 | 0 | Government | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2 | 11 | 52 | 0 | 6 | 0 | 50-60 | 60k - 70k | 45k-270k | 1564014 | Consumer loans | 5357.250 | 24282.0 | 20106.0 | 24282.0 | FRIDAY | 5 | Y | 1 | XAP | Approved | Cash through the bank | XAP | New | Mobile | POS | XNA | Regional / Local | 30 | Connectivity | 4.0 | middle | POS mobile without interest | 2 |
data_final.shape
(1292479, 72)
data_final.columns.values
array(['SK_ID_CURR', 'TARGET', 'NAME_CONTRACT_TYPE_x', 'CODE_GENDER',
'CNT_CHILDREN', 'AMT_INCOME_TOTAL', 'AMT_CREDIT_x',
'AMT_ANNUITY_x', 'AMT_GOODS_PRICE_x', 'NAME_TYPE_SUITE',
'NAME_INCOME_TYPE', 'NAME_EDUCATION_TYPE', 'NAME_FAMILY_STATUS',
'NAME_HOUSING_TYPE', 'REGION_POPULATION_RELATIVE',
'OCCUPATION_TYPE', 'CNT_FAM_MEMBERS', 'REGION_RATING_CLIENT',
'REGION_RATING_CLIENT_W_CITY', 'WEEKDAY_APPR_PROCESS_START_x',
'HOUR_APPR_PROCESS_START_x', 'REG_REGION_NOT_LIVE_REGION',
'REG_REGION_NOT_WORK_REGION', 'LIVE_REGION_NOT_WORK_REGION',
'REG_CITY_NOT_LIVE_CITY', 'REG_CITY_NOT_WORK_CITY',
'LIVE_CITY_NOT_WORK_CITY', 'ORGANIZATION_TYPE',
'OBS_30_CNT_SOCIAL_CIRCLE', 'DEF_30_CNT_SOCIAL_CIRCLE',
'OBS_60_CNT_SOCIAL_CIRCLE', 'DEF_60_CNT_SOCIAL_CIRCLE',
'AMT_REQ_CREDIT_BUREAU_HOUR', 'AMT_REQ_CREDIT_BUREAU_DAY',
'AMT_REQ_CREDIT_BUREAU_WEEK', 'AMT_REQ_CREDIT_BUREAU_MON',
'AMT_REQ_CREDIT_BUREAU_QRT', 'AMT_REQ_CREDIT_BUREAU_YEAR',
'YEAR_LAST_PHONE_CHANGE', 'YEAR_REGISTRATION', 'AGE',
'YEAR_EMPLOYED', 'YEAR_ID_PUBLISH', 'All_Flag_document', 'Age',
'AMT_INCOME_GROUP', 'AMT_CREDIT_GROUP', 'SK_ID_PREV',
'NAME_CONTRACT_TYPE_y', 'AMT_ANNUITY_y', 'AMT_APPLICATION',
'AMT_CREDIT_y', 'AMT_GOODS_PRICE_y',
'WEEKDAY_APPR_PROCESS_START_y', 'HOUR_APPR_PROCESS_START_y',
'FLAG_LAST_APPL_PER_CONTRACT', 'NFLAG_LAST_APPL_IN_DAY',
'NAME_CASH_LOAN_PURPOSE', 'NAME_CONTRACT_STATUS',
'NAME_PAYMENT_TYPE', 'CODE_REJECT_REASON', 'NAME_CLIENT_TYPE',
'NAME_GOODS_CATEGORY', 'NAME_PORTFOLIO', 'NAME_PRODUCT_TYPE',
'CHANNEL_TYPE', 'SELLERPLACE_AREA', 'NAME_SELLER_INDUSTRY',
'CNT_PAYMENT', 'NAME_YIELD_GROUP', 'PRODUCT_COMBINATION',
'YEAR_DECISION'], dtype=object)
train_0 = application_data.loc[application_data['TARGET'] == 0] # train_0 - It contains the data where the loan was repaid
train_1 = application_data.loc[application_data['TARGET'] == 1] # train_1 - It contains the data where the loan was not repaid
ptrain_0 = data_final.loc[data_final['TARGET'] == 0] # ptrain_0 - It contains the data where the loan was repaid
ptrain_1 = data_final.loc[data_final['TARGET'] == 1] # ptrain_1 - It contains the data where the loan was not repaid
def plotting(column, hue):
col = column
hue = hue
fig = plt.figure(figsize=(13,10))
ax1 = plt.subplot(221)
application_data[col].value_counts().plot.pie(autopct = "%1.0f%%", ax=ax1) # autopct is used to show the percentage on the pie chart
plt.title('Plotting data for the column: '+ column)
ax2 = plt.subplot(222)
df = pd.DataFrame()
df['0']= ((train_0[col].value_counts())/len(train_0))
df['1']= ((train_1[col].value_counts())/len(train_1))
df.plot.bar(ax=ax2) # plotting the bar chart
plt.title('Plotting data for target in terms of total count')
ax3 = plt.subplot(223)
sns.countplot(x=col, hue=hue, data=ptrain_0, ax = ax3) # plotting the countplot
plt.xticks(rotation=90)
plt.title('Plotting data for Target=0 in terms of percentage')
ax4 = plt.subplot(224)
sns.countplot(x=col, hue=hue, data=ptrain_1, ax = ax4) # plotting the countplot
plt.xticks(rotation=90)
plt.title('Plotting data for Target=1 in terms of percentage')
fig.tight_layout() # Or equivalently, "plt.tight_layout()"
plt.show()
plotting('NAME_EDUCATION_TYPE','NAME_CONTRACT_STATUS')
plotting('Age','NAME_EDUCATION_TYPE')
INSIGHTS¶
NAME_EDUCATION_TYPE VS NAME_CONTRACT_STATUS¶
As we can see higher education and secondary education gives more approved loans and as able repay the loan as well There is less defaulter percentage if we give a loan to higher education people
Age VS NAME_EDUCATION_TYPE¶
There is a mjority of perople who are under age of 30- 40 year of age who has repay the loan we can target the people under the age of 30 to 60 year of age groups who are safe to give the loan and higher chances they repay the loan as well with this particular data We have to take care of clients who has not done higher education or whoes education are incomplete the can be defaulter
def pointplotdata(column1, column2):
col1 = column1
col2 = column2
fig = plt.figure(figsize=(13,10))
ax1 = plt.subplot(221)
sns.pointplot(data=data_final, x=col1, y=col2, hue='TARGET')
plt.xticks(rotation=90)
plt.title('Plotting data for the TARGET column vs : '+ column2 + " " +'By'+ " "+ column1)
pointplotdata('NAME_CONTRACT_STATUS','AMT_INCOME_TOTAL')
INSIGHTS¶
we can see here the people whoes income is higher than others have not used the offer we can target these kind of people as well
pointplotdata('NAME_CONTRACT_STATUS','DEF_60_CNT_SOCIAL_CIRCLE')
INSIGHTS¶
Clients who have average of 0.13 or higher DEF_60_CNT_SOCIAL_CIRCLE score tend to default more and hence client's social circle has to be analysed before providing the loan.
pointplotdata('NAME_HOUSING_TYPE','AMT_CREDIT_x')
INSIGHTS¶
the People who are leaving in co-op apartment and office apartment has higher chances to be defaulter
Conclusion¶
Analysing the whole dataset we can consider the follwing thing that would help bank to provide the loan who can repay the or not
Successful applicant will be Repayer:¶
- AGE : - Loan applicants above the age of 60 has a lower tendency to default.
- NAME_INCOME_TYPE :- Student and Businessmen have no defaults.
- AMT_INCOME_TOTAL :- Applicants with incomes exceeding 10 Millions experience a lower likelihood of default.
- NAME_HOUSING_TYPE :- Applicant leaving in there own house, or parents , or rental apartment are lower likelihood of default
- CNT_CHILDREN:- People with zero to two children tend to repay the loans.
applicant will be Defaulter:¶
- NAME_EDUCATION_TYPE: People with Lower Secondary & Secondary education has higher defaulter
- CNT_CHILDREN: Client who have children equal to or more than 9 default 100% and hence their applications are to be rejected.
- CODE_GENDER: Men are at relatively higher default rate
- NAME_FAMILY_STATUS : People who have civil marriage or who are single default a lot.
- OCCUPATION_TYPE: Avoid Low-skill Laborers, Drivers and Waiters staff, Security staff, Laborers and Cooking staff has the higher default rate .
- DAYS_BIRTH: Avoid young people who are in age group of 20-30 as they have higher probability of defaulting
- AMT_GOODS_PRICE: When the credit amount goes beyond 3M, there is an increase in defaulters.